31 research outputs found
Improving Zero-Shot Cross-lingual Transfer Between Closely Related Languages by Injecting Character-Level Noise
Cross-lingual transfer between a high-resource language and its dialects or closely related language varieties should be facilitated by their similarity. However, current approaches that operate in the embedding space do not take surface similarity into account. This work presents a simple yet effective strategy to improve cross-lingual transfer between closely related varieties. We propose to augment the data of the high-resource source language with character-level noise to make the model more robust towards spelling variations. Our strategy shows consistent improvements over several languages and tasks: Zero-shot transfer of POS tagging and topic identification between language varieties from the Finnic, West and North Germanic, and Western Romance language branches. Our work provides evidence for the usefulness of simple surface-level noise in improving transfer between language varieties
Improving Zero-shot Cross-lingual Transfer between Closely Related Languages by Injecting Character-level Noise
Cross-lingual transfer between a high-resource language and its dialects or
closely related language varieties should be facilitated by their similarity.
However, current approaches that operate in the embedding space do not take
surface similarity into account. This work presents a simple yet effective
strategy to imrove cross-lingual transfer between closely related varieties. We
propose to augment the data of the high-resource source language with
character-level noise to make the model more robust towards spelling
variations. Our strategy shows consistent improvements over several languages
and tasks: Zero-shot transfer of POS tagging and topic identification between
language varieties from the Finnic, West and North Germanic, and Western
Romance language branches. Our work provides evidence for the usefulness of
simple surface-level noise in improving transfer between language varieties.Comment: ACL 202
Reducing Gender Bias in NMT with FUDGE
Gender bias appears in many neural machine translation (NMT) models and commercial translation software. Research has become more aware of this problem in recent years and there has been work on mitigating gender bias. However, the challenge of addressing gender bias in NMT persists. This work utilizes a controlled text generation method, Future Discriminators for Generation (FUDGE), to reduce the so-called Speaking As gender bias. This bias emerges when translating from English to a language that openly marks the gender of the speaker. We evaluate the model on MuST-SHE, a challenge set to specifically evaluate gender translation. The results demonstrate improvements in the translation accuracy of the feminine terms
On Biasing Transformer Attention Towards Monotonicity
Many sequence-to-sequence tasks in natural language processing are roughly
monotonic in the alignment between source and target sequence, and previous
work has facilitated or enforced learning of monotonic attention behavior via
specialized attention functions or pretraining. In this work, we introduce a
monotonicity loss function that is compatible with standard attention
mechanisms and test it on several sequence-to-sequence tasks:
grapheme-to-phoneme conversion, morphological inflection, transliteration, and
dialect normalization. Experiments show that we can achieve largely monotonic
behavior. Performance is mixed, with larger gains on top of RNN baselines.
General monotonicity does not benefit transformer multihead attention, however,
we see isolated improvements when only a subset of heads is biased towards
monotonic behavior.Comment: To be published in: Proceedings of the 2021 Conference of the North
American Chapter of the Association for Computational Linguistics: Human
Language Technologies (NAACL-HLT 2021
Building a Parallel Corpus on the World's Oldest Banking Magazine
We report on our processing steps to build a diachronic parallel corpus based on the world's oldest banking magazine. The magazine has been published since 1895 in German, with translations in French and partly in English and Italian. Our data sources are printed issues (until 1997), PDF issues (since 1998) and HTML files (since 2001). The corpus building poses special challenges in article boundary recognition and cross-language article and sentence alignment. Our corpus fills a gap in parallel corpora with respect to genre (magazine articles), domain (banking and economy articles), and its time span (120 years)
Findings of the VarDial Evaluation Campaign 2022
This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2022. The campaign is part of the ninth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2022. Three separate shared tasks were included this year: Identification of Languages and Dialects of Italy (ITDI), French Cross-Domain Dialect Identification (FDI), and Dialectal Extractive Question Answering (DialQA). All three tasks were organized for the first time this year
Findings of the VarDial Evaluation Campaign 2022
This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2022. The campaign is part of the ninth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with COLING 2022. Three separate shared tasks were included this year: Identification of Languages and Dialects of Italy (ITDI), French Cross-Domain Dialect Identification (FDI), and Dialectal Extractive Question Answering (DialQA). All three tasks were organized for the first time this year.Non peer reviewe
Findings of the VarDial Evaluation Campaign 2023
This report presents the results of the shared tasks organized as part of the VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects (VarDial), co-located with EACL 2023. Three separate shared tasks were included this year: Slot and intent detection for low-resource language varieties (SID4LR), Discriminating Between Similar Languages — True Labels (DSL-TL), and Discriminating Between Similar Languages — Speech (DSL-S). All three tasks were organized for the first time this year
Findings of the VarDial Evaluation Campaign 2023
This report presents the results of the shared tasks organized as part of the
VarDial Evaluation Campaign 2023. The campaign is part of the tenth workshop on
Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects
(VarDial), co-located with EACL 2023. Three separate shared tasks were included
this year: Slot and intent detection for low-resource language varieties
(SID4LR), Discriminating Between Similar Languages -- True Labels (DSL-TL), and
Discriminating Between Similar Languages -- Speech (DSL-S). All three tasks
were organized for the first time this year